Ranking the {tidyverse} packages for bioinformatics analysis.

from MY perspective

Marcel Ferreira (@marceelrf)

Bioinformatics

  • Genetics and Genomics;

  • Collect, store, analyze and, disseminate biological information;

  • Transcriptomic and Proteomic data;

  • Differential expression and enrichment analysis.

The {tidyverse}

  • Hadley Wickham;

  • tidy data:

    1.  Every column is variable.
    
    2.  Every row is an observation.
    
    3.  Every cell is a single value.
  • Designed for data science;

  • Each package has a common design philosophy, grammar, and data structures.

The {tidyverse}

tidyverse packages - Posit

The {tidyverse}

9. {lubridate}

  • New member;

  • Designed to handle dates and times data;

  • These formats are not widely used in bioinformatics analysis;

  • factor.

8. {forcats}

  • Categorical variables;

  • All functions starts with fct_;

  • Although extremely useful, I have never used it myself in my analysis pipelines.

7. {readr}

  • Data import and export;

  • CSV, TSV, TXT, and other delimited files;

  • The R-base import functions are also effective and often run internally in bioconductor packages.

6. {tibble}

  • Modern data.frame;

  • It’s simples to convert data.frame, list and vectors to tibble format;

  • tibble();

  • rowid_to_column(), rownames_to_column(), column_to_rownames(), enframe() and deframe().

5. {purrr}

  • Enhances R’s functional programming toolkit;

  • Lists and vectors;

  • map family;

  • flatten and reduce;

  • Combined with list-columns produce one of the most powerful analysis pipelines in R.

4. {stringr}

  • The package to deal with strings;

  • DNA, RNA and Proteins are Biological strings;

  • All functions starts with str_;

  • Regular expression (regex);

3. {tidyr}

  • Functions to transform into tidy data format;

  • pivot_longer() and pivot_wider();

  • list-columns with nest() and unnest();

  • separate() and separate_rows();

2. {ggplot2}

  • Data visualization;

  • Based on the grammar of graphics;

  • Bioconductor packages run ggplot inside;

  • More than 50 extensions;

1. {dplyr}

  • Data wrangling;

  • Data manipulation;

  • Five verbs;

  • Joins;

  • SQL syntax;

Final list

  1. {dplyr}
  2. {ggplot2}
  3. {tidyr}
  4. {stringr}
  5. {purrr}
  6. {tibble}
  7. {readr}
  8. {forcats}
  9. {lubridate}